Skip to content

Conversation

qsang-nv
Copy link
Collaborator

@qsang-nv qsang-nv commented Sep 25, 2025

📌 Description

Add xqa fp8 mha and fp8 kv cache.

🔍 Related Issues

🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.

✅ Pre-commit Checks

  • I have installed pre-commit by running pip install pre-commit (or used your preferred method).
  • I have installed the hooks with pre-commit install.
  • I have run the hooks manually with pre-commit run --all-files and fixed any reported issues.

If you are unsure about how to set up pre-commit, see the pre-commit documentation.

🧪 Tests

  • Tests have been added or updated as needed.
  • All tests are passing (unittest, etc.).

Reviewer Notes

Summary by CodeRabbit

  • New Features

    • FP8 acceleration for attention on newer GPU SMs (SM100, SM120) and multi-SM targeting
    • Optional paged KV cache layout and separate K/V cache inputs for improved memory handling
    • New low-level async tensor-map/TMA and matrix-descriptor primitives to boost GPU throughput
    • Public API now accepts SM selection, cache type, and workspace buffer parameters
  • Bug Fixes

    • Improved numerical stability for attention mask initialization
  • Tests

    • Expanded tests for FP8/FP16 modes, cache layouts, and additional SM targets
  • Documentation

    • Added API docs for the XQA attention entrypoint

@qsang-nv qsang-nv changed the title add xqa fp8 mha add xqa fp8 mha and fp8 kv cache Sep 25, 2025
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @qsang-nv, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the FlashInfer library by integrating FP8 support for both Multi-Head Attention computations and the Key-Value cache within the XQA framework. These changes are primarily aimed at leveraging the advanced capabilities of NVIDIA Hopper GPUs (SM90+) to achieve substantial performance and memory efficiency gains. The implementation includes new CUDA kernels utilizing GMMA and TMA, along with Python-side modifications to enable configurable FP8 execution paths, ensuring that users can opt into these optimizations while maintaining numerical stability.

Highlights

  • FP8 Multi-Head Attention (MHA) Support: Introduced support for FP8 (8-bit floating point) Multi-Head Attention, enabling more efficient computations on compatible hardware.
  • FP8 Key-Value (KV) Cache: Added functionality for using FP8 for the Key-Value cache, which can significantly reduce memory footprint and improve performance.
  • NVIDIA Hopper (SM90+) Optimizations: Integrated specialized Hopper GPU features like GMMA (General Matrix Multiply Accumulate) and TMA (Tensor Memory Accelerator) for optimized FP8 operations and efficient memory access patterns.
  • Configurable FP8 Execution: The XQA module generation and runtime execution now allow explicit control over whether FP8 MHA and FP8 KV cache are utilized, providing flexibility for different precision requirements.
  • Numerical Stability Adjustments: Modified the safeInitRowMax value and adjusted test tolerances to account for the numerical characteristics of lower precision FP8 computations, ensuring stability.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @qsang-nv, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the FlashInfer library by introducing support for FP8 Multi-Head Attention (MHA) and FP8 Key-Value (KV) cache. These additions leverage advanced features of NVIDIA Hopper GPUs, such as GMMA and TMA, to achieve higher performance and memory efficiency for large language model inference. The changes span the CUDA C++ backend, including new kernel implementations and memory management utilities, as well as updates to the Python AOT compilation and testing framework to ensure robust integration and validation of the new FP8 capabilities.

Highlights

  • FP8 MHA Integration: Introduced a new "run_fp8_mha" boolean parameter to the "xqa_wrapper" function, enabling conditional execution of FP8 Multi-Head Attention kernels.
  • FP8 KV Cache Support: Added "fp8_kv_cache" parameter to the AOT compilation and Python interface, allowing the use of FP8 for Key-Value cache storage.
  • New Low-Level CUDA Kernels: Incorporated "gmma.cuh" for Generic Matrix Multiply Accumulate (GMMA) operations and "tma.h" for Tensor Memory Accelerator (TMA) asynchronous memory transfers, leveraging Hopper (SM90) architecture features for optimized FP8 performance.
  • CUDA Tensor Map Utilities: Introduced "tensorMap.cpp" and "tensorMap.h" to facilitate efficient access and management of KV caches using CUDA Tensor Maps.
  • Numerical Stability Improvements: Adjusted "safeInitRowMax" in "utils.cuh" and modified attention mask application in "mha.cu" to enhance numerical stability, especially relevant for lower precision formats.
  • Expanded Testing: Updated "test_xqa.py" to include comprehensive tests for FP8 MHA and FP8 KV cache, with adjusted numerical tolerance checks to account for reduced precision.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for FP8 multi-head attention (MHA) and FP8 KV cache within the XQA kernel, primarily targeting the NVIDIA Hopper architecture. This is a significant feature addition, enabled by new CUDA primitives for Hopper's Tensor Memory Access (TMA) and Grace Hopper MMA (GMMA) instructions. The changes are well-implemented, including new CUDA headers for hardware abstraction, a dispatch mechanism for the new FP8 kernel path, and corresponding updates to the Python build system and tests. The tests have been thoughtfully adjusted with relaxed tolerances for FP8 precision. My review includes one suggestion to refactor a small piece of duplicated code to enhance maintainability.

Comment on lines 37 to 72
if (run_fp8_mha) {
launchHopperF8MHAFlashInfer(
multiProcessorCount, nbKHeads, slidingWinSize, qScale,
reinterpret_cast<OutputHead*>(output.data_ptr()),
#if LOW_PREC_OUTPUT
reinterpret_cast<float const*>(rcpOutScale.data_ptr()),
reinterpret_cast<float const*>(rcpOutScale.data_ptr()),
#endif
reinterpret_cast<InputHead const*>(q.data_ptr()), attentionSinksPtr,
reinterpret_cast<GMemCacheHead*>(pool.data_ptr()),
reinterpret_cast<KVCachePageIndex const*>(kvCachePageList.data_ptr()),
maxSeqLen, reinterpret_cast<uint32_t const*>(seqLen.data_ptr()), batchSize,
reinterpret_cast<float const*>(kvCacheScale.data_ptr()),
reinterpret_cast<InputHead const*>(q.data_ptr()), attentionSinksPtr,
reinterpret_cast<GMemCacheHead*>(pool.data_ptr()),
reinterpret_cast<KVCachePageIndex const*>(kvCachePageList.data_ptr()), maxSeqLen,
reinterpret_cast<uint32_t const*>(seqLen.data_ptr()), batchSize,
reinterpret_cast<float const*>(kvCacheScale.data_ptr()),
#if SPEC_DEC
qSeqLen, reinterpret_cast<uint32_t const*>(qCuSeqLens.data_ptr()),
reinterpret_cast<MaskType const*>(mask.data_ptr()),
qSeqLen, reinterpret_cast<uint32_t const*>(qCuSeqLens.data_ptr()),
reinterpret_cast<MaskType const*>(mask.data_ptr()),
#endif
reinterpret_cast<uint32_t*>(semaphores.data_ptr()),
reinterpret_cast<void*>(scratch.data_ptr()), stream);
reinterpret_cast<uint32_t*>(semaphores.data_ptr()),
reinterpret_cast<void*>(scratch.data_ptr()), stream);
} else {
launchMHAFlashInfer(multiProcessorCount, nbKHeads, slidingWinSize, qScale,
reinterpret_cast<OutputHead*>(output.data_ptr()),
#if LOW_PREC_OUTPUT
reinterpret_cast<float const*>(rcpOutScale.data_ptr()),
#endif
reinterpret_cast<InputHead const*>(q.data_ptr()), attentionSinksPtr,
reinterpret_cast<GMemCacheHead*>(pool.data_ptr()),
reinterpret_cast<KVCachePageIndex const*>(kvCachePageList.data_ptr()),
maxSeqLen, reinterpret_cast<uint32_t const*>(seqLen.data_ptr()), batchSize,
reinterpret_cast<float const*>(kvCacheScale.data_ptr()),
#if SPEC_DEC
qSeqLen, reinterpret_cast<uint32_t const*>(qCuSeqLens.data_ptr()),
reinterpret_cast<MaskType const*>(mask.data_ptr()),
#endif
reinterpret_cast<uint32_t*>(semaphores.data_ptr()),
reinterpret_cast<void*>(scratch.data_ptr()), stream);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The if and else blocks contain identical arguments passed to two different functions (launchHopperF8MHAFlashInfer and launchMHAFlashInfer). This code duplication can be reduced to improve maintainability. Since both functions share the same signature, you can use a function pointer to select the appropriate kernel and then call it once with the common set of arguments.

  using mha_launcher_t = decltype(&launchMHAFlashInfer);
  mha_launcher_t launcher = run_fp8_mha ? &launchHopperF8MHAFlashInfer : &launchMHAFlashInfer;
  launcher(multiProcessorCount, nbKHeads, slidingWinSize, qScale,
           reinterpret_cast<OutputHead*>(output.data_ptr()),
#if LOW_PREC_OUTPUT
           reinterpret_cast<float const*>(rcpOutScale.data_ptr()),
#endif
           reinterpret_cast<InputHead const*>(q.data_ptr()), attentionSinksPtr,
           reinterpret_cast<GMemCacheHead*>(pool.data_ptr()),
           reinterpret_cast<KVCachePageIndex const*>(kvCachePageList.data_ptr()),
           maxSeqLen, reinterpret_cast<uint32_t const*>(seqLen.data_ptr()), batchSize,
           reinterpret_cast<float const*>(kvCacheScale.data_ptr()),
#if SPEC_DEC
           qSeqLen, reinterpret_cast<uint32_t const*>(qCuSeqLens.data_ptr()),
           reinterpret_cast<MaskType const*>(mask.data_ptr()),
#endif
           reinterpret_cast<uint32_t*>(semaphores.data_ptr()),
           reinterpret_cast<void*>(scratch.data_ptr()), stream);

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @qsang-nv, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the XQA (eXtended Query Attention) kernels by introducing support for FP8 Multi-Head Attention and FP8 Key-Value cache. These changes are designed to optimize performance and memory usage on NVIDIA Hopper (SM90+) GPUs through low-level CUDA programming, including asynchronous tensor memory access and matrix operations. The integration ensures that the system can efficiently handle lower precision data types, with comprehensive testing to maintain accuracy within acceptable bounds.

Highlights

  • FP8 Multi-Head Attention (MHA) Support: Introduced a new launchHopperF8MHAFlashInfer function and integrated it into the xqa_wrapper to enable FP8 MHA execution, controlled by a new run_fp8_mha boolean parameter.
  • FP8 KV Cache Implementation: Added support for FP8 Key-Value (KV) cache, including new CUtensorMap utilities for efficient memory access and configuration options in the AOT/JIT compilation system.
  • Low-Level CUDA Optimizations: Incorporated new CUDA kernel files (gmma.cuh, tma.h) that define asynchronous matrix multiply accumulate (GMMA) operations and tensor memory access (TMA) functions, leveraging Hopper architecture features for performance.
  • Numerical Stability Improvements: Adjusted the safeInitRowMax constant and its usage in applyMaskFromInput to enhance numerical stability, especially for large values, in attention calculations.
  • AOT/JIT Compilation and Testing: Updated the AOT and JIT compilation infrastructure to generate kernels for various FP8 configurations and expanded the test suite to thoroughly validate the new FP8 MHA and KV cache functionalities, including adjusted precision checks.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for FP8 multi-head attention (MHA) and FP8 KV cache in the XQA kernels, targeting Hopper architecture for performance improvements. The changes include new low-level CUDA files (gmma.cuh, tma.h, tensorMap.cpp) with Hopper-specific WGMMA and TMA instructions, a new FP8 MHA kernel entry point, and updates to the AOT compilation scripts and Python wrappers to handle the new FP8 variants. The tests have also been updated to include FP8 configurations and use a more lenient assertion method to account for precision differences.

My review focuses on code maintainability and clarity. I've suggested refactoring a duplicated code block in the C++ wrapper to improve readability and proposed adding a comment in the Python tests to clarify a magic number used for data scaling. Overall, the changes are well-structured and the addition of FP8 support is a valuable performance enhancement.

Comment on lines 37 to 72
if (run_fp8_mha) {
launchHopperF8MHAFlashInfer(
multiProcessorCount, nbKHeads, slidingWinSize, qScale,
reinterpret_cast<OutputHead*>(output.data_ptr()),
#if LOW_PREC_OUTPUT
reinterpret_cast<float const*>(rcpOutScale.data_ptr()),
reinterpret_cast<float const*>(rcpOutScale.data_ptr()),
#endif
reinterpret_cast<InputHead const*>(q.data_ptr()), attentionSinksPtr,
reinterpret_cast<GMemCacheHead*>(pool.data_ptr()),
reinterpret_cast<KVCachePageIndex const*>(kvCachePageList.data_ptr()),
maxSeqLen, reinterpret_cast<uint32_t const*>(seqLen.data_ptr()), batchSize,
reinterpret_cast<float const*>(kvCacheScale.data_ptr()),
reinterpret_cast<InputHead const*>(q.data_ptr()), attentionSinksPtr,
reinterpret_cast<GMemCacheHead*>(pool.data_ptr()),
reinterpret_cast<KVCachePageIndex const*>(kvCachePageList.data_ptr()), maxSeqLen,
reinterpret_cast<uint32_t const*>(seqLen.data_ptr()), batchSize,
reinterpret_cast<float const*>(kvCacheScale.data_ptr()),
#if SPEC_DEC
qSeqLen, reinterpret_cast<uint32_t const*>(qCuSeqLens.data_ptr()),
reinterpret_cast<MaskType const*>(mask.data_ptr()),
qSeqLen, reinterpret_cast<uint32_t const*>(qCuSeqLens.data_ptr()),
reinterpret_cast<MaskType const*>(mask.data_ptr()),
#endif
reinterpret_cast<uint32_t*>(semaphores.data_ptr()),
reinterpret_cast<void*>(scratch.data_ptr()), stream);
reinterpret_cast<uint32_t*>(semaphores.data_ptr()),
reinterpret_cast<void*>(scratch.data_ptr()), stream);
} else {
launchMHAFlashInfer(multiProcessorCount, nbKHeads, slidingWinSize, qScale,
reinterpret_cast<OutputHead*>(output.data_ptr()),
#if LOW_PREC_OUTPUT
reinterpret_cast<float const*>(rcpOutScale.data_ptr()),
#endif
reinterpret_cast<InputHead const*>(q.data_ptr()), attentionSinksPtr,
reinterpret_cast<GMemCacheHead*>(pool.data_ptr()),
reinterpret_cast<KVCachePageIndex const*>(kvCachePageList.data_ptr()),
maxSeqLen, reinterpret_cast<uint32_t const*>(seqLen.data_ptr()), batchSize,
reinterpret_cast<float const*>(kvCacheScale.data_ptr()),
#if SPEC_DEC
qSeqLen, reinterpret_cast<uint32_t const*>(qCuSeqLens.data_ptr()),
reinterpret_cast<MaskType const*>(mask.data_ptr()),
#endif
reinterpret_cast<uint32_t*>(semaphores.data_ptr()),
reinterpret_cast<void*>(scratch.data_ptr()), stream);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There's a large block of duplicated code for launching the MHA kernels. The only difference between the if and else blocks is the function being called (launchHopperF8MHAFlashInfer vs. launchMHAFlashInfer). This could be refactored to improve maintainability and reduce redundancy.

Consider using a function pointer to select the kernel, and then make a single call. This would make the code cleaner and easier to manage if more arguments are added in the future.

For example:

void (*mha_func)(uint32_t, uint32_t, ...); // Using a function pointer type alias

if (run_fp8_mha) {
    mha_func = &launchHopperF8MHAFlashInfer;
} else {
    mha_func = &launchMHAFlashInfer;
}

mha_func(
    multiProcessorCount,
    nbKHeads,
    slidingWinSize,
    // ... other arguments
);
  using mha_func_t = void (*)(uint32_t, uint32_t, uint32_t, float, OutputHead*,
#if LOW_PREC_OUTPUT
                              float const*,
#endif
                              InputHead const*, float const*, GMemCacheHead*,
                              KVCachePageIndex const*, uint32_t, uint32_t const*, uint32_t,
                              float const* __restrict__,
#if SPEC_DEC
                              uint32_t, uint32_t const*, MaskType const*,
#endif
                              uint32_t*, void*, cudaStream_t);

  mha_func_t mha_func = run_fp8_mha ? &launchHopperF8MHAFlashInfer : &launchMHAFlashInfer;

  mha_func(multiProcessorCount, nbKHeads, slidingWinSize, qScale,
           reinterpret_cast<OutputHead*>(output.data_ptr()),
#if LOW_PREC_OUTPUT
           reinterpret_cast<float const*>(rcpOutScale.data_ptr()),
#endif
           reinterpret_cast<InputHead const*>(q.data_ptr()), attentionSinksPtr,
           reinterpret_cast<GMemCacheHead*>(pool.data_ptr()),
           reinterpret_cast<KVCachePageIndex const*>(kvCachePageList.data_ptr()), maxSeqLen,
           reinterpret_cast<uint32_t const*>(seqLen.data_ptr()), batchSize,
           reinterpret_cast<float const*>(kvCacheScale.data_ptr()),
#if SPEC_DEC
           qSeqLen, reinterpret_cast<uint32_t const*>(qCuSeqLens.data_ptr()),
           reinterpret_cast<MaskType const*>(mask.data_ptr()),
#endif
           reinterpret_cast<uint32_t*>(semaphores.data_ptr()),
           reinterpret_cast<void*>(scratch.data_ptr()), stream);

Comment on lines 230 to 233
if fp8_kv_cache:
cache_heads /= 4.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The value 4.0 is used to scale down the cache_heads tensor when fp8_kv_cache is enabled. This appears to be a magic number. To improve code clarity and maintainability, please add a comment explaining the rationale for this specific scaling factor. For example, explaining that it's to prevent overflow and how 4.0 was determined would be very helpful for future readers.

Suggested change
if fp8_kv_cache:
cache_heads /= 4.0
if fp8_kv_cache:
# Scale down the cache heads to keep values within the representable range of FP8
# and prevent overflow during computation. The factor 4.0 is chosen empirically.
cache_heads /= 4.0

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for FP8 multi-head attention (MHA) and FP8 KV cache for Cross-Query Attention (XQA), targeting NVIDIA's Hopper architecture. This is a significant feature addition that leverages low-level hardware capabilities like TMA and GMMA for performance. The changes include new CUDA files for these Hopper-specific features, along with updates to the Python build system and tests to accommodate the new configurations. The review identified a critical bug in the new TMA storeAsync implementation and a high-severity correctness issue related to the handling of masked values in the softmax computation.

Comment on lines +220 to +243
: "memory");
} else if constexpr (nbDims == 5) {
asm volatile(
"cp.async.bulk.tensor.2d.global.shared::cta.bulk_group.tile [%0, {%1, %2, %3, %4, %5}], "
"[%6];\n"
:
: "l"(reinterpret_cast<uint64_t>(&tensorMap)), "r"(offset[0]), "r"(offset[1]),
"r"(offset[2]), "r"(offset[3]), "r"(offset[4]), "l"(__cvta_generic_to_shared(src))
: "memory");
} else {
static_assert(nbDims >= 1 && nbDims <= 5);
}
}

__device__ inline void setTensorMapGlbAddr(CUtensorMap& tensorMap, void* ptr) {
asm volatile(
"tensormap.replace.tile.global_address.global.b1024.b64 [%0], %1;\n" ::"l"(&tensorMap),
"l"(ptr)
: "memory");
}

__device__ inline void commitGroup() {
asm volatile("cp.async.bulk.commit_group;\n" : : : "memory");
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

There appears to be a copy-paste error in the storeAsync template function. For nbDims of 3, 4, and 5, the inline assembly instruction is cp.async.bulk.tensor.2d..., but it should be cp.async.bulk.tensor.3d..., cp.async.bulk.tensor.4d..., and cp.async.bulk.tensor.5d... respectively. This will lead to incorrect memory access patterns and likely data corruption for higher-dimensional tensors.

    else if constexpr (nbDims == 3)
    {
        asm volatile("cp.async.bulk.tensor.3d.global.shared::cta.bulk_group.tile [%0, {%1, %2, %3}], [%4];\n"
                     :
                     : "l"(reinterpret_cast<uint64_t>(&tensorMap)), "r"(offset[0]), "r"(offset[1]), "r"(offset[2]),
                     "l"(__cvta_generic_to_shared(src))
                     : "memory");
    }
    else if constexpr (nbDims == 4)
    {
        asm volatile("cp.async.bulk.tensor.4d.global.shared::cta.bulk_group.tile [%0, {%1, %2, %3, %4}], [%5];\n"
                     :
                     : "l"(reinterpret_cast<uint64_t>(&tensorMap)), "r"(offset[0]), "r"(offset[1]), "r"(offset[2]),
                     "r"(offset[3]), "l"(__cvta_generic_to_shared(src))
                     : "memory");
    }
    else if constexpr (nbDims == 5)
    {
        asm volatile("cp.async.bulk.tensor.5d.global.shared::cta.bulk_group.tile [%0, {%1, %2, %3, %4, %5}], [%6];\n"
                     :
                     : "l"(reinterpret_cast<uint64_t>(&tensorMap)), "r"(offset[0]), "r"(offset[1]), "r"(offset[2]),
                     "r"(offset[3]), "r"(offset[4]), "l"(__cvta_generic_to_shared(src))
                     : "memory");
    }

? true
: packedMask & (1u << ((col + actualQSeqLen - nbValidCols) - maskPosStart));
acc(m, n)(i, j) = maskFlag && col < nbValidCols ? acc(m, n)(i, j) : -INFINITY;
acc(m, n)(i, j) = maskFlag && col < nbValidCols ? acc(m, n)(i, j) : safeInitRowMax;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Using safeInitRowMax for masked elements can lead to incorrect results. When an entire row/sequence is masked, all attention scores become safeInitRowMax. In the softmax computation, maxVal also becomes safeInitRowMax, and exp(score - maxVal) evaluates to 1 for all masked positions. This results in a uniform attention distribution over masked tokens, and the output becomes the average of values in V, instead of zero.

A correct implementation should ensure that the softmax output for masked tokens is zero. If the entire row is masked, the final output should also be zero. This might require changes in the softmax function to handle safeInitRowMax specially, and in the final normalization step to handle a row sum of zero.

@qsang-nv qsang-nv requested a review from yzh119 September 25, 2025 08:52
@@ -16,8 +16,8 @@

#include "pytorch_extension_utils.h"

void xqa_wrapper(int64_t multiProcessorCount, int64_t nbKHeads, int64_t slidingWinSize,
double qScale, at::Tensor output,
void xqa_wrapper(bool run_fp8_mha, int64_t multiProcessorCount, int64_t nbKHeads,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of making this a flag, could we pass a dtype?

Same for the other places where we pass:

  • the type of the input (only bf16 and fp16 supported I think)
  • the type of the kv-cache (fp8 or bf16)
  • the type in which we perform arithmetic (the same type as the kv-cache I think?)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now it is passing dtype in flashinfer/flashinfer/xqa.py


inline constexpr float log2e = 1.4426950408889634; // std::log2(M_E)
inline constexpr float safeInitRowMax = -1e+30F;
// we used an optimization where exp(x-rowMax) is computed as:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's interesting: what were the symptoms of the instability? Accuracy loss?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is copied from NVIDIA/TensorRT-LLM@c1aa7f3, you may ask the author, I am not sure about this question.

@@ -354,4 +364,21 @@ def cache_head_at(
kernel_output = output[req][b][
idx_k_head * head_grp_size : (idx_k_head + 1) * head_grp_size
].to(torch.float32)
assert torch.allclose(ref_output, kernel_output, atol=0.01, rtol=0.01)
if fp8_kv_cache or run_fp8_mha:
atol = 0.05
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How did you tune this tolerance? Can it be smaller?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From 0.01 to 0.05, add 0.01 every step. And it can't be smaller from my test.

Signed-off-by: Qidi Sang <200703406+qsang-nv@users.noreply.github.com>
Signed-off-by: Qidi Sang <200703406+qsang-nv@users.noreply.github.com>
Signed-off-by: Qidi Sang <200703406+qsang-nv@users.noreply.github.com>
Signed-off-by: Qidi Sang <200703406+qsang-nv@users.noreply.github.com>
Signed-off-by: Qidi Sang <200703406+qsang-nv@users.noreply.github.com>
Signed-off-by: Qidi Sang <200703406+qsang-nv@users.noreply.github.com>
else:
flag_sliding_window = ["-DSLIDING_WINDOW=0"]

if sm_version == 100:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to add SM103 support by targeting SM100f instead of SM100a?

And similarly, can we add SM121 support by targeting SM120f instead of SM120a?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the difference in those archs? I mean SM103/SM100f/SM100a, and SM121/SM120f/SM120a.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "a" means "arch-specific" and the "f" means "family".

SM100 and SM103 are in the "SM100f family".

SM100a (SM100 arch-specific) will only run on SM100 devices.

But I believe SM103 devices have a superset of the SM100 features, and therefore if you target SM100f instead of SM100a during compilation, your cubin will be able to run on SM103 as well, without any loss of optimization on either device. So I think it's strictly better than targeting SM103a.

SM121 and SM120 have a similar story: it's better to target SM120f as a compilation target, yielding a cubin that will run on both SM120 and SM121 devices without any compromise to performance.

See this documentation for details: https://docs.nvidia.com/cuda/cuda-c-programming-guide/#family-specific-features

@aleozlx can you confirm my understanding?

Copy link
Collaborator

@aleozlx aleozlx Oct 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes sm_100f is known as family specific or family conditional. this is important to enhance device compatibility in a sort of out of the box fashion.

i'd like to point out a few things tho based on my experience:

  1. with strictly jit compilation where the premise is that at runtime fewer devices are in the compatibility question, the arch conditionals may be the safest way to target the instruction supersets. (when the target is available to compile by the toolkit at the time of implementation)
  2. family conditionals are important for compatibility story (indeed aligning with your understanding conceptually) but it is not without inherent engineering complexity. from an engineer's perspective i naturally experience a slightly more complicated story at the levels beneath. i'll spare the details but leave it as a reliability intuition (in the context of jit).

however, i want to bring this CompilationContext.get_nvcc_flags_list() up for consideration to not hard code any guidance either way but have it abstracted so we can adjust if situation changes. briefly, how this works is for each op/backend if you whitelist your supported targets, this function shall serve as the mapping to provide the recommended flags.

we can put in sm103 support (likely fine for attn) supposing our cicd will catch the issue if not

Copy link
Member

@sricketts sricketts Oct 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, let me clarify my guidance then:

  1. Since SM120 and SM121 are identical architectures, naively I would think that supporting both is only marginally harder than supporting one.
  2. The story seems even a little harder for SM100 and SM103, because they are not identical, but if you don't care about SM103-specific features, I would think the marginal effort of supporting SM103 shouldn't be massive.
  3. Therefore the default design for any solution for SM100 and SM120 should at least try to include SM103 and SM121 support, or should at least be designed with SM103 and SM121 in mind, even if some of the details are left as a future TODO.
  4. What compilation targets you use, and how you architect the code to query for those compilation targets, is an engineering implementation detail, and I probably shouldn't be opinionated about that. :)

The problem here is that the PR doesn't address (3), maybe because @qsang-nv didn't know about (1) and (2). I'm only proposing that we re-think the design here with the above in mind.

@aleozlx @qsang-nv do you agree?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the detailed explanation! I've added support for sm100f and sm121a.

Signed-off-by: Qidi Sang <200703406+qsang-nv@users.noreply.github.com>
Signed-off-by: Qidi Sang <200703406+qsang-nv@users.noreply.github.com>
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Oct 17, 2025

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

Walkthrough

Adds FP8-capable MHA dispatch and selects between Hopper FP8 and standard MHA via a new run_sm90_fp8_mha flag; introduces conditional paged KV-cache layout (separate K/V cache vs pool) across CUDA headers, kernels, and Python build/runtime layers; adds tensor-map and TMA async APIs and extends SM support (SM100/SM120).

Changes

Cohort / File(s) Summary
C++ binding / entry
csrc/flashinfer_xqa_binding.cu
Added leading bool run_sm90_fp8_mha parameter to xqa_wrapper binding; changed attentionSinks to tvm::ffi::Optional<TensorView>; reordered and guarded KV-cache parameters with #if PAGED_KV_CACHE_LAYOUT (either kCacheVLLM,vCacheVLLM or pool).
XQA wrapper / dispatcher
csrc/xqa/xqa_wrapper.cu
Added run_sm90_fp8_mha param and mha_func selector to call either launchHopperF8MHAFlashInfer or launchMHAFlashInfer; conditional passing of kCacheVLLM/vCacheVLLM vs pool; adjusted argument ordering for subsequent params.
MHA interfaces & kernels
csrc/xqa/mha.h, csrc/xqa/mha.cu
Updated launchMHAFlashInfer / launchHopperF8MHA signatures to accept conditional KV-cache params (kCacheVLLM,vCacheVLLM when PAGED_KV_CACHE_LAYOUT==1, else pool); added launchHopperF8MHAFlashInfer declaration; kernel launch call sites updated to pass batchSize and kvCacheScale consistently; arch guards extended to include __CUDA_ARCH__ == 1000.
Matrix MMA helpers
csrc/xqa/gmma.cuh
New public header: gmma namespace with SwizzleMode, bitfield-packed MatDesc, helper constructors (makeMatDesc, addAddr), inst constants, and templated mma_async_* declarations plus fence/commit/wait primitives; includes gmma_impl.cuh.
TensorMap utilities
csrc/xqa/tensorMap.h, csrc/xqa/tensorMap.cpp
New header and implementation to build CUtensorMap for contiguous and paged KV cache layouts, getElemBytes helper, swizzle/part selection, and error checks; layout differs based on PAGED_KV_CACHE_LAYOUT.
TMA async API
csrc/xqa/tma.h
New CUDA header exposing tma namespace: StateSpace enum, (conditional) CUtensorMap type, asynchronous load/store/prefetch primitives, tensor-map address setter, commitGroup/waitGroup, and prefetch helpers implemented with inline assembly.
Utilities
csrc/xqa/utils.cuh
Changed safeInitRowMax constant to -1e+5F with explanatory comments about numerical stability; extended kMAX_SMEM_SIZE arch conditional to include __CUDA_ARCH__ == 1000.
Python JIT
flashinfer/jit/xqa.py
gen_xqa_module signature updated to input_dtype, kv_cache_dtype, page_size, head_dim, head_group_ratio, use_sliding_window, sm_version; added kv-cache dtype handling, sm-specific NVCC flags, added mha_sm90.cu and tensorMap.cpp sources, -lcuda ldflag and -DPAGED_KV_CACHE_LAYOUT=1.
Python AOT / build
flashinfer/aot.py
gen_xqa renamed params (fp16_input_, fp8_kv_cache_), iterates sm_versions over available SMs (90/100/120), propagates sm_version and kv-cache dtype into specs.
Python public API & wiring
flashinfer/xqa.py
get_xqa_module and xqa signatures updated to accept input_dtype, kv_cache_dtype, page_size, head_dim, head_group_ratio, sm_version; public xqa now takes run_sm90_fp8_mha, k_cache, v_cache, page_table, seq_lens, workspace_buffer and forwards new args to module.
Tests
tests/attention/test_xqa.py
Test updated to VLLM page layout indexing, separate K/V caches, fp16_input and fp8_kv_cache flags, new SM gating (90/100/120), page list population changes, optional FP8 scaling, and relaxed tolerances for FP8 tests with pass-ratio checks.
Docs
docs/api/attention.rst
Added documentation entry for flashinfer.xqa module and xqa symbol.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant CppBinding as flashinfer_xqa_binding
    participant XQAWrapper as xqa_wrapper (C++)
    participant MHA_Dispatcher as MHA Launcher
    participant HopperFP8 as HopperF8 MHA Kernel
    participant StdMHA as Standard MHA Kernel

    User->>CppBinding: call xqa(...) with run_sm90_fp8_mha
    CppBinding->>XQAWrapper: forward args (run_sm90_fp8_mha, caches, page_table, ...)
    XQAWrapper->>MHA_Dispatcher: select mha_func based on run_sm90_fp8_mha

    alt run_sm90_fp8_mha == true
        MHA_Dispatcher->>HopperFP8: launchHopperF8MHAFlashInfer(..., kCacheVLLM, vCacheVLLM, ...)
        HopperFP8-->>XQAWrapper: results
    else
        MHA_Dispatcher->>StdMHA: launchMHAFlashInfer(..., pool or k/v cache per layout, ...)
        StdMHA-->>XQAWrapper: results
    end

    XQAWrapper-->>CppBinding: return outputs
    CppBinding-->>User: deliver attention outputs
Loading
sequenceDiagram
    participant PythonClient as flashinfer.xqa
    participant ModuleGen as gen_xqa_module
    participant NVCC as Compiler
    participant KVLayout as KV Cache Layout Logic

    PythonClient->>ModuleGen: request module (input_dtype, kv_cache_dtype, sm_version)
    ModuleGen->>KVLayout: include -DPAGED_KV_CACHE_LAYOUT=1 (choose k/v cache layout)
    ModuleGen->>NVCC: compile with sm-specific flags and extra sources (tensorMap.cpp, mha_sm90.cu)
    NVCC-->>ModuleGen: compiled XQA module
    ModuleGen-->>PythonClient: module handle
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Suggested reviewers

  • yzh119

Poem

🐇 I hopped through pages, K and V in tow,

FP8 sparks and swizzles set the flow,
Async maps hum, barriers clap in time,
SMs stretch wider, kernels learn new rhymes,
A bunny's cheer: fast attention, watch it go!

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 18.18% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Title Check ✅ Passed The PR title "add xqa fp8 mha and fp8 kv cache" accurately and specifically describes the primary changes in the pull request. The changeset introduces FP8 multi-head attention (MHA) capabilities and FP8 key/value (KV) cache support across multiple source files, Python modules, and tests. The title is concise, clear, and uses specific terminology rather than vague placeholders, making it immediately understandable to reviewers scanning the commit history.
Description Check ✅ Passed The pull request description addresses the core required sections of the template. The 📌 Description section is present, stating "Add xqa fp8 mha and fp8 kv cache," which conveys the main change. More importantly, all critical checklist items under ✅ Pre-commit Checks and 🧪 Tests are marked as completed with checkmarks, indicating the author has performed the necessary pre-submission validation. While the description is minimal (single sentence) and the Related Issues and Reviewer Notes sections are not filled out, these gaps are acceptable as non-critical sections per the template structure.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9ef83d1 and 7372dd6.

📒 Files selected for processing (7)
  • csrc/flashinfer_xqa_binding.cu (1 hunks)
  • csrc/xqa/xqa_wrapper.cu (2 hunks)
  • docs/api/attention.rst (1 hunks)
  • flashinfer/aot.py (2 hunks)
  • flashinfer/jit/xqa.py (1 hunks)
  • flashinfer/xqa.py (3 hunks)
  • tests/attention/test_xqa.py (8 hunks)
✅ Files skipped from review due to trivial changes (1)
  • docs/api/attention.rst
🧰 Additional context used
🧬 Code graph analysis (6)
csrc/flashinfer_xqa_binding.cu (1)
csrc/fused_moe/cutlass_backend/flashinfer_cutlass_fused_moe_sm100_binding.cu (4)
  • output (230-396)
  • output (230-238)
  • output (398-566)
  • output (398-409)
flashinfer/aot.py (2)
flashinfer/jit/core.py (1)
  • JitSpec (181-280)
flashinfer/jit/xqa.py (1)
  • gen_xqa_module (41-113)
flashinfer/xqa.py (4)
flashinfer/utils.py (6)
  • get_device_sm_count (589-590)
  • register_custom_op (266-275)
  • register_custom_op (285-304)
  • register_fake_op (277-281)
  • register_fake_op (306-311)
  • get_compute_capability (245-248)
flashinfer/jit/xqa.py (1)
  • gen_xqa_module (41-113)
csrc/flashinfer_xqa_binding.cu (1)
  • xqa_wrapper (19-35)
csrc/xqa/xqa_wrapper.cu (2)
  • xqa_wrapper (22-66)
  • xqa_wrapper (22-38)
tests/attention/test_xqa.py (1)
flashinfer/utils.py (1)
  • get_compute_capability (245-248)
csrc/xqa/xqa_wrapper.cu (2)
csrc/xqa/mha_sm90.cu (4)
  • launchHopperF8MHAFlashInfer (3168-3275)
  • launchHopperF8MHAFlashInfer (3168-3185)
  • scratch (506-513)
  • scratch (506-506)
csrc/xqa/mha.cu (2)
  • launchMHAFlashInfer (2657-2749)
  • launchMHAFlashInfer (2657-2674)
flashinfer/jit/xqa.py (1)
flashinfer/jit/core.py (2)
  • JitSpec (181-280)
  • gen_jit_spec (283-347)
🪛 Ruff (0.14.0)
flashinfer/xqa.py

234-234: Avoid specifying long messages outside the exception class

(TRY003)

tests/attention/test_xqa.py

265-265: Unused function argument: beam_width

(ARG001)

flashinfer/jit/xqa.py

55-57: Avoid specifying long messages outside the exception class

(TRY003)


67-69: Avoid specifying long messages outside the exception class

(TRY003)


73-75: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Deploy Docs
🔇 Additional comments (26)
flashinfer/jit/xqa.py (6)

18-27: LGTM: Import additions support new dtype-aware configuration.

The imports correctly bring in torch for dtype handling, filename_safe_dtype_map for safe naming, and SM-specific compiler flags to support multiple GPU architectures.


29-38: LGTM: NVCC flags correctly configure paged KV cache layout.

The flags properly enable the new paged KV cache layout (PAGED_KV_CACHE_LAYOUT=1) and disable input KV usage (USE_INPUT_KV=0), aligning with the architectural changes across the PR.


50-64: LGTM: Dtype validation is thorough and correct.

The dtype validation correctly restricts input to float16/bfloat16 and handles KV cache dtypes (fp8, int8, or matching input dtype). The flag generation properly maps dtypes to compiler defines.

Note: Static analysis suggests extracting long error messages (TRY003), but for configuration validation, inline descriptive messages are acceptable and aid debugging.


66-78: LGTM: Configuration validation ensures tensor core compatibility.

The validation correctly enforces constraints for page size (power-of-2 values from 16-128) and head dimension (16-byte aligned, max 256), which are required for efficient tensor core operations.


85-92: Consider SM103 support and family-specific compilation targets.

The current implementation handles SM100 (using family-specific sm100f_nvcc_flags), SM120, and SM121, with a fallback to SM90. However, based on past review discussions, there are open questions:

  1. SM103 support: Should SM103 be explicitly supported? The past review thread suggests SM103 devices can run SM100f family-specific code.
  2. SM120/SM121 family flags: The past review suggested that SM120f (family-specific) could run on both SM120 and SM121 devices. Currently, you're using arch-specific flags for both.

The decision on whether to add explicit SM103 handling or use family vs arch-specific flags depends on your target device matrix and compatibility requirements.

Based on past review discussions, would you like to:

  • Add explicit SM103 handling (mapping to sm100f_nvcc_flags)?
  • Consider using sm120f (family) instead of sm120a/sm121a (arch-specific) for broader compatibility?

Based on learnings


94-112: LGTM: Module generation correctly reflects new configuration parameters.

The module naming scheme properly incorporates all configuration dimensions (input dtype, KV cache dtype, page size, head dim, head group ratio, sliding window, SM version), ensuring unique compilation per variant. The addition of mha_sm90.cu, tensorMap.cpp, and the CUDA Driver API linkage (-lcuda) correctly support the new FP8 MHA and tensor map functionality.

csrc/xqa/xqa_wrapper.cu (3)

22-32: LGTM: Function signature correctly supports dual KV cache layouts.

The conditional compilation properly handles two KV cache layouts:

  • Layout 1: Separate K and V cache tensors (kCacheVLLM, vCacheVLLM)
  • Layout 0: Unified pool tensor

The run_sm90_fp8_mha flag enables dynamic selection between FP8 Hopper MHA and standard MHA kernels.


43-43: LGTM: Function pointer selection eliminates code duplication.

This implementation addresses the past review feedback by using a function pointer to select the appropriate MHA launcher, eliminating the need for duplicated call sites.


45-65: LGTM: Unified launcher call is clean and correct.

The single call site to mha_func properly handles all conditional compilation paths (LOW_PREC_OUTPUT, PAGED_KV_CACHE_LAYOUT, SPEC_DEC) with appropriate type casting and parameter ordering. This maintains code clarity while supporting multiple configurations.

flashinfer/aot.py (5)

358-380: LGTM: Function signature expanded to support FP8 KV cache and multiple SM versions.

The parameter renaming (use_fp16_fp16_input_) improves clarity, and the addition of fp8_kv_cache_ and SM version flags (has_sm100, has_sm120) enables proper per-architecture module generation. The early return when no supported SM versions are available is a good optimization.


381-397: LGTM: Configuration iteration properly expands across all dimensions.

The product iteration correctly includes the new fp16_input, fp8_kv_cache, and sm_version parameters, ensuring modules are generated for all valid configuration combinations.


404-409: LGTM: KV cache dtype derivation follows sensible precedence.

The logic correctly prioritizes FP8 when requested, then matches the KV cache dtype to the input dtype (fp16 or bf16), which is a reasonable default strategy for AOT compilation.


410-418: LGTM: gen_xqa_module call correctly passes all configuration parameters.

The call properly maps all configuration dimensions (input_dtype, kv_cache_dtype, page_size, head_dim, head_group_ratio, use_sliding_window, sm_version) to the updated gen_xqa_module signature.


532-552: LGTM: Caller correctly updated with new parameter naming and FP8 support.

The gen_all_modules function properly renames xqa_use_fp16_ to xqa_fp16_input_, adds xqa_fp8_kv_cache_, and passes the SM version flags (has_sm100, has_sm120) to gen_xqa.

tests/attention/test_xqa.py (6)

52-58: LGTM: VLLM layout indexing correctly implemented.

The updated indexing calculation properly maps to the VLLM paged KV cache layout ([page_idx][token_in_page][nb_heads][head_dim]), which aligns with the PAGED_KV_CACHE_LAYOUT=1 configuration used throughout the PR.


152-162: LGTM: Test parameters expanded to cover FP8 and multiple SM versions.

The test fixture correctly:

  • Expands compute capability checks to SM90, SM100, and SM120
  • Adds fp16_input and fp8_kv_cache parameters to test dtype combinations
  • Narrows nb_k_heads to [2, 4] (likely for test performance optimization)

215-250: LGTM: Cache initialization correctly handles separate K/V caches and FP8 scaling.

The test properly:

  • Allocates separate cache_k_heads and cache_v_heads tensors for the VLLM layout
  • Applies FP8 scaling (division by 4.0) with a clear explanatory comment addressing past review feedback
  • Initializes page list sequentially before shuffling

The FP8 scaling comment properly explains the rationale for the empirically-determined factor of 4.0.


257-279: LGTM: cache_head_at function correctly handles separate K/V caches.

The function properly:

  • Accepts separate cache_k_heads and cache_v_heads tensors
  • Implements VLLM layout indexing consistently with the CacheSeq class
  • Returns from the appropriate cache based on the is_k flag

Note: Static analysis flags beam_width as unused. This parameter appears to be maintained for interface consistency even though beam_width=1 is the only supported value in the current implementation.


314-330: LGTM: Kernel call correctly updated for separate K/V caches and FP8 support.

The XQA kernel invocation properly:

  • Passes separate cache_k_heads and cache_v_heads with conditional FP8 conversion
  • Uses the updated parameter order (page_table, seq_lens, etc.)
  • Maintains all configuration parameters (sinks, q_scale, kv_scale, sliding window, sm_count)

370-389: LGTM: Tolerance checking is appropriately relaxed for FP8 with robust pass ratio validation.

The updated validation:

  • Uses higher tolerance (0.05 atol/rtol) for FP8 KV cache vs standard precision (0.01), which is appropriate given FP8's reduced precision
  • Implements a pass ratio approach (99% of elements must meet tolerance) instead of requiring perfect element-wise matching, which is more robust for approximate computations
  • Addresses past review feedback regarding tolerance tuning

This approach properly balances test strictness with the inherent approximation errors of FP8 arithmetic.

csrc/flashinfer_xqa_binding.cu (1)

19-35: LGTM: Binding signature correctly mirrors wrapper function.

The TVM FFI binding properly declares the xqa_wrapper function with all updated parameters:

  • run_sm90_fp8_mha flag for kernel selection
  • Optional attention sinks
  • Conditional KV cache layout parameters (kCacheVLLM/vCacheVLLM vs pool)

This maintains consistency across the C++/Python boundary.

flashinfer/xqa.py (5)

33-50: LGTM: Module caching and parameterization enable efficient per-configuration JIT.

The @functools.cache decorator appropriately memoizes compiled modules per configuration (input dtype, KV cache dtype, page size, head dim, head group ratio, sliding window, SM version), avoiding redundant compilation while supporting all configuration combinations.


52-121: LGTM: Custom op registration correctly parameterized for torch.compile support.

The registration properly:

  • Generates unique names per configuration using dtype-safe identifiers
  • Declares mutation of output and workspace_buffer for graph analysis
  • Provides both real and fake implementations for torch.compile compatibility
  • Updates parameter names for clarity (workspace_buffer vs scratch)

141-200: LGTM: Comprehensive documentation clarifies API usage and constraints.

The docstring thoroughly documents:

  • All parameter shapes, dtypes, and constraints
  • The paged KV cache layout (VLLM-style)
  • Automatic parameter inference from tensor shapes
  • Optional parameters and their defaults

This will greatly aid users in understanding and correctly using the XQA API.


202-238: LGTM: Parameter inference and capability detection are well-designed.

The implementation properly:

  • Infers runtime parameters (sm_count, batch_size, head dimensions) from tensor shapes and device properties
  • Validates K and V cache dtype consistency (line 223) - critical for correctness
  • Restricts FP8 MHA to SM90 hardware (line 225-231) where it's supported
  • Detects and validates SM version (9, 10, 12 for SM90/SM100/SM120)
  • Provides defaults for optional parameters (kv_scale, sm_count)

This design minimizes API surface while maintaining flexibility.


240-267: LGTM: Module loading and kernel invocation correctly wired.

The implementation properly:

  • Retrieves the cached module with the full configuration tuple (dtype, page size, head dims, sliding window, SM version)
  • Passes all required parameters in the correct order to the underlying kernel
  • Uses inferred values (sm_count, max_seq_len, batch_size, etc.) derived earlier

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tests/attention/test_xqa.py (1)

28-33: Avoid GPU property access at import time.

Accessing torch.cuda.get_device_properties(0) during import can break test discovery on CPU/multi-device envs. Move it inside the test after skip checks.

-props = torch.cuda.get_device_properties(0)
-sm_count = props.multi_processor_count
+sm_count = None  # set inside test to avoid import-time CUDA queries
♻️ Duplicate comments (4)
csrc/xqa/tma.h (1)

208-229: Bug: 3D/4D/5D storeAsync use 2D opcode (will corrupt data).

The cp.async store paths for nbDims 3–5 incorrectly use tensor.2d. Must be tensor.3d/4d/5d.

Apply this diff:

@@
   } else if constexpr (nbDims == 3) {
-    asm volatile(
-        "cp.async.bulk.tensor.2d.global.shared::cta.bulk_group.tile [%0, {%1, %2, %3}], [%4];\n"
-        :
-        : "l"(reinterpret_cast<uint64_t>(&tensorMap)), "r"(offset[0]), "r"(offset[1]),
-          "r"(offset[2]), "l"(__cvta_generic_to_shared(src))
-        : "memory");
+    asm volatile(
+        "cp.async.bulk.tensor.3d.global.shared::cta.bulk_group.tile [%0, {%1, %2, %3}], [%4];\n"
+        :
+        : "l"(reinterpret_cast<uint64_t>(&tensorMap)), "r"(offset[0]), "r"(offset[1]),
+          "r"(offset[2]), "l"(__cvta_generic_to_shared(src))
+        : "memory");
   } else if constexpr (nbDims == 4) {
-    asm volatile(
-        "cp.async.bulk.tensor.2d.global.shared::cta.bulk_group.tile [%0, {%1, %2, %3, %4}], [%5];\n"
-        :
-        : "l"(reinterpret_cast<uint64_t>(&tensorMap)), "r"(offset[0]), "r"(offset[1]),
-          "r"(offset[2]), "r"(offset[3]), "l"(__cvta_generic_to_shared(src))
-        : "memory");
+    asm volatile(
+        "cp.async.bulk.tensor.4d.global.shared::cta.bulk_group.tile [%0, {%1, %2, %3, %4}], [%5];\n"
+        :
+        : "l"(reinterpret_cast<uint64_t>(&tensorMap)), "r"(offset[0]), "r"(offset[1]),
+          "r"(offset[2]), "r"(offset[3]), "l"(__cvta_generic_to_shared(src))
+        : "memory");
   } else if constexpr (nbDims == 5) {
-    asm volatile(
-        "cp.async.bulk.tensor.2d.global.shared::cta.bulk_group.tile [%0, {%1, %2, %3, %4, %5}], "
-        "[%6];\n"
-        :
-        : "l"(reinterpret_cast<uint64_t>(&tensorMap)), "r"(offset[0]), "r"(offset[1]),
-          "r"(offset[2]), "r"(offset[3]), "r"(offset[4]), "l"(__cvta_generic_to_shared(src))
-        : "memory");
+    asm volatile(
+        "cp.async.bulk.tensor.5d.global.shared::cta.bulk_group.tile [%0, {%1, %2, %3, %4, %5}], [%6];\n"
+        :
+        : "l"(reinterpret_cast<uint64_t>(&tensorMap)), "r"(offset[0]), "r"(offset[1]),
+          "r"(offset[2]), "r"(offset[3]), "r"(offset[4]), "l"(__cvta_generic_to_shared(src))
+        : "memory");
   }
csrc/xqa/mha.cu (1)

479-479: Critical: Masked position initialization may cause incorrect attention output.

Using safeInitRowMax for masked elements can lead to incorrect results. When an entire row is masked, all scores become safeInitRowMax, and in softmax computation exp(score - maxVal) evaluates to 1 for all positions, producing a uniform distribution over masked tokens instead of zero output.

As noted in the previous review, the softmax function should handle safeInitRowMax specially to ensure masked tokens contribute zero to the output, or alternatively masked positions should use a different sentinel value that results in zero after softmax.

csrc/flashinfer_xqa_binding.cu (1)

19-21: Prefer a typed precision enum over a new boolean flag.

Using bool run_fp8_mha does not scale. Replace with a small enum (e.g., int32_t precision: {bf16, fp16, fp8}) and, similarly, pass/cache element/compute dtypes as enums instead of separate flags. This reduces combinatorial overload and ABI churn.

tests/attention/test_xqa.py (1)

241-246: Good: FP8 cache scaling is documented.

Comment explains the 4.0 factor and overflow concerns.

🧹 Nitpick comments (10)
csrc/xqa/tma.h (1)

74-83: Comment/code mismatch for nbDims==1 path.

The comment says “nbDims==1 does not need tensormap,” but the code uses the tensor.1d variant taking a tensor map. Either drop the map for 1D linear copies or update the comment.

Also applies to: 129-138

csrc/xqa/gmma.cuh (1)

27-56: Bitfield layout is implementation-defined; prefer explicit packing.

Relying on 64‑bit bitfield layout and reinterpret_cast to Raw can be brittle across compilers/ABIs. Recommend encoding/decoding with shifts/masks into a uint64_t to guarantee layout and endianness. Keep sizeof(MatDesc)==8 as a guard.

csrc/xqa/tensorMap.h (1)

3-3: cuda.h include: make header robust to non-CUDA analysis/compiles.

Static analysis flagged ‘cuda.h’ not found. If this header is transitively included by non‑CUDA TU(s), guard the include or move these declarations behind a build flag. Example: wrap with a small shim header included only from .cpp, or add a dedicated config that ensures CUDA include paths are present in CI.

csrc/xqa/tensorMap.cpp (1)

43-73: Tensor map for contiguous KV cache looks correct.

The function properly constructs a tensor map for contiguous KV cache layout:

  • Global dimensions and strides are configured appropriately
  • Swizzle selection based on cache line size (128B or 64B)
  • Error handling via checkCu wrapper

Minor suggestion: The error message on line 64 "unsupported cache head size" could be more specific about expected values.

-        throw std::runtime_error("unsupported cache head size");
+        throw std::runtime_error("unsupported partElems: " + std::to_string(partElems) + 
+                                 ", expected 128 or 64");
flashinfer/jit/xqa.py (1)

76-100: SM version selection and build configuration verified; optional refactor still recommended for clarity and error handling.

The changes are correct:

  • New source files (mha_sm90.cu, tensorMap.cpp) exist in csrc/xqa/
  • Build configuration properly references and links them
  • Required CUDA Driver API linker flag (-lcuda) and cache layout flag included

However, the SM version selection logic could be improved for maintainability. The current code defaults to sm90a_nvcc_flags for unrecognized versions, which implicitly handles sm_version=90 but obscures intent and provides no validation for truly unsupported architectures.

Consider making the SM90 case explicit and adding validation:

-    if sm_version == 100:
+    if sm_version == 90:
+        sm_nvcc_flags = sm90a_nvcc_flags
+    elif sm_version == 100:
         sm_nvcc_flags = sm100a_nvcc_flags
     elif sm_version == 120:
         sm_nvcc_flags = sm120a_nvcc_flags
     else:
-        sm_nvcc_flags = sm90a_nvcc_flags
+        raise ValueError(f"Unsupported sm_version: {sm_version}")

This makes supported architectures explicit and catches invalid SM versions early.

csrc/flashinfer_xqa_binding.cu (1)

25-35: KV cache params behind preprocessor guards: keep Python and C++ signatures locked.

Since params differ when PAGED_KV_CACHE_LAYOUT!=1, ensure JIT always defines it (the JIT path does) and document this contract near the binding to avoid accidental ABI mismatches. Consider adding a static assert/log print on init when it’s not set.

Also applies to: 28-29

tests/attention/test_xqa.py (3)

263-275: Remove unused beam_width arg (lint: ARG001).

beam_width in cache_head_at is unused; drop it and update call sites.

-def cache_head_at(
+def cache_head_at(
     batch,
     is_k,
     idx_kv_head,
     pos,
-    cache_k_heads,
-    cache_v_heads,
-    page_list,
-    beam_width,
+    cache_k_heads,
+    cache_v_heads,
+    page_list,
     nb_k_heads,
     tokens_per_page,
 ):
@@
-                    cache_head = cache_head_at(
+                    cache_head = cache_head_at(
                         batch,
                         kv == 0,
                         idx_kv_head,
                         pos,
                         cache_k_heads,
                         cache_v_heads,
-                        page_list_arg,
-                        beam_width,
+                        page_list_arg,
                         nb_k_heads,
                         tokens_per_page,
                     )

Also applies to: 291-303


317-319: Make scratch size configurable; 256 MiB may OOM CI.

Read from an env var with a sane default to reduce flakiness.

-    scratch_size = 256 << 20
+    import os
+    scratch_mb = int(os.environ.get("FLASHINFER_TEST_SCRATCH_MB", "256"))
+    scratch_size = scratch_mb << 20

You can validate with different values in CI matrix.


392-397: Stable epsilon for relative diff.

Optional: use dtype-aware epsilon via torch.finfo to avoid hard-coded 1e-8.

-                diff_rel = diff_abs / (torch.abs(ref_output) + 1e-8)
+                eps = torch.finfo(torch.float32).eps
+                diff_rel = diff_abs / (torch.abs(ref_output) + eps)
flashinfer/xqa.py (1)

147-150: Avoid repeated capability queries and shorten the error.

Cache CC once and use a shorter exception message (addresses Ruff TRY003).

-    if get_compute_capability(torch.device(device="cuda"))[0] not in [9, 10, 12]:
-        raise RuntimeError("XQA is only supported on SM90, SM100, SM120 GPUs")
-    sm_version = int(get_compute_capability(torch.device(device="cuda"))[0] * 10)
+    cc_major, _ = get_compute_capability(torch.device(device="cuda"))
+    if cc_major not in (9, 10, 12):
+        raise RuntimeError("Unsupported GPU (require SM90/100/120)")
+    sm_version = int(cc_major * 10)
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4b55b26 and 9ef83d1.

📒 Files selected for processing (13)
  • csrc/flashinfer_xqa_binding.cu (1 hunks)
  • csrc/xqa/gmma.cuh (1 hunks)
  • csrc/xqa/mha.cu (5 hunks)
  • csrc/xqa/mha.h (2 hunks)
  • csrc/xqa/tensorMap.cpp (1 hunks)
  • csrc/xqa/tensorMap.h (1 hunks)
  • csrc/xqa/tma.h (1 hunks)
  • csrc/xqa/utils.cuh (2 hunks)
  • csrc/xqa/xqa_wrapper.cu (2 hunks)
  • flashinfer/aot.py (3 hunks)
  • flashinfer/jit/xqa.py (2 hunks)
  • flashinfer/xqa.py (6 hunks)
  • tests/attention/test_xqa.py (10 hunks)
🧰 Additional context used
🧬 Code graph analysis (10)
csrc/xqa/tensorMap.h (1)
csrc/xqa/tensorMap.cpp (6)
  • getElemBytes (10-41)
  • getElemBytes (10-10)
  • makeTensorMapForContiguousKVCache (43-73)
  • makeTensorMapForContiguousKVCache (43-47)
  • makeTensorMapForPagedKVCache (75-117)
  • makeTensorMapForPagedKVCache (75-78)
csrc/xqa/tensorMap.cpp (1)
csrc/xqa/utils.h (1)
  • checkCu (39-48)
flashinfer/jit/xqa.py (1)
flashinfer/jit/core.py (2)
  • JitSpec (181-280)
  • gen_jit_spec (283-347)
flashinfer/xqa.py (3)
flashinfer/jit/xqa.py (1)
  • gen_xqa_module (38-101)
flashinfer/jit/core.py (1)
  • build_and_load (268-280)
flashinfer/utils.py (3)
  • register_custom_op (266-275)
  • register_custom_op (285-304)
  • get_compute_capability (245-248)
csrc/xqa/xqa_wrapper.cu (2)
csrc/xqa/mha_sm90.cu (4)
  • launchHopperF8MHAFlashInfer (3168-3275)
  • launchHopperF8MHAFlashInfer (3168-3185)
  • scratch (506-513)
  • scratch (506-506)
csrc/xqa/mha.cu (2)
  • launchMHAFlashInfer (2657-2749)
  • launchMHAFlashInfer (2657-2674)
flashinfer/aot.py (1)
flashinfer/jit/core.py (1)
  • JitSpec (181-280)
csrc/xqa/mha.h (1)
csrc/xqa/mha_sm90.cu (4)
  • launchHopperF8MHAFlashInfer (3168-3275)
  • launchHopperF8MHAFlashInfer (3168-3185)
  • scratch (506-513)
  • scratch (506-506)
tests/attention/test_xqa.py (1)
flashinfer/utils.py (1)
  • get_compute_capability (245-248)
csrc/xqa/tma.h (1)
csrc/xqa/mha_sm90.cu (16)
  • void (548-577)
  • void (579-584)
  • void (588-598)
  • void (1693-1727)
  • void (1765-1797)
  • void (1799-1816)
  • void (1841-1887)
  • void (1976-1997)
  • void (1999-2017)
  • void (2049-2131)
  • void (2180-2254)
  • void (2256-2275)
  • void (2278-2296)
  • void (2316-2332)
  • void (2336-2359)
  • void (2396-2420)
csrc/flashinfer_xqa_binding.cu (1)
csrc/fused_moe/cutlass_backend/flashinfer_cutlass_fused_moe_sm100_binding.cu (4)
  • output (230-396)
  • output (230-238)
  • output (398-566)
  • output (398-409)
🪛 Clang (14.0.6)
csrc/xqa/tensorMap.h

[error] 3-3: 'cuda.h' file not found

(clang-diagnostic-error)

🪛 Ruff (0.14.0)
flashinfer/xqa.py

148-148: Avoid specifying long messages outside the exception class

(TRY003)

tests/attention/test_xqa.py

271-271: Unused function argument: beam_width

(ARG001)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Deploy Docs
🔇 Additional comments (21)
csrc/xqa/utils.cuh (2)

34-41: Numerical-stability note: initialize rowMax safely but validate ranges seen in practice.

Lowering safeInitRowMax to -1e5 avoids FMA overflow in x*log2e - bias, but it changes the effective lower bound. Please validate on adversarial logits (very negative rows) to ensure no early saturation and no accuracy regressions. Consider guarding the optimization per-arch or switching to compute (x - rowMax) before scaling to avoid FMA on large magnitudes.


49-51: Code is correct; review comment contains incorrect assumptions.

SM100 (Blackwell) opt-in dynamic shared memory per block is 227 KB, which matches the value at line 50. SM120 (Hopper Next) is 99 KB, which is already correctly configured on line 46—not on lines 49-51 as the review suggests.

The conditional structure properly segregates architectures:

  • Line 45-46: SM120 (__CUDA_ARCH__ == 1200) → 99 KB ✓
  • Line 49-50: SM100 (__CUDA_ARCH__ == 1000) → 227 KB ✓

Lines 49-51 handle only SM90 and SM100; SM120 is on a separate branch.

Likely an incorrect or invalid review comment.

csrc/xqa/gmma.cuh (1)

60-66: Assumption: shared address fits 18 bits (0x3FFFF) — verify on SM100/SM120.

The assert/encode mask assumes __cvta_generic_to_shared(data) fits 18 bits (>>4 stored). Please validate against SM100/SM120 SMEM addressing; increase field width or add debug fallback if this trips on larger SMEM configs.

csrc/xqa/mha.h (2)

138-150: API surface change verified — no action required.

Declaration (mha.h:133), definition (mha.cu:2657), and the single call site (xqa_wrapper.cu:43–68) all use identical #if PAGED_KV_CACHE_LAYOUT == 1 preprocessor guards. Arguments are correctly conditionalized at the call site to match the conditional signature. Python bindings invoke through xqa_wrapper, which handles the dispatch internally, isolating the FFI boundary from the conditional signature. ODR/ABI consistency is maintained within the compilation unit.


194-211: All concerns verified successfully.

The function declaration (mha.h:194) and definition (mha_sm90.cu:3168) match exactly, including all conditional parameters (LOW_PREC_OUTPUT, PAGED_KV_CACHE_LAYOUT, SPEC_DEC). The function is properly integrated: it's used in xqa_wrapper.cu:43 with correct FP8 conditional logic (run_fp8_mha ? &launchHopperF8MHAFlashInfer : &launchMHAFlashInfer), defined in the correct TU (mha_sm90.cu), and linked properly without issues.

csrc/xqa/mha.cu (2)

95-95: LGTM: SM100 architecture support added correctly.

The addition of __CUDA_ARCH__ == 1000 appropriately extends support to SM100 devices with the same configuration as SM800/870/900, using larger kHeadPartBytes (128) and cacheVTileSeqLen (64).


2657-2722: Based on my analysis of the kernel signature and the function call, there is a parameter mismatch in the launchMHAFlashInfer function call when SPEC_DEC is defined.

Looking at the kernel_mha signature (mha_sm90.cu:615-652), the kernel expects:

  • nbKHeads (always)
  • slidingWinSize (if SLIDING_WINDOW)
  • qScale, output, rcpOutScale, q, attentionSinks, cacheList
  • beamSearchParams (if USE_BEAM_SEARCH)
  • batchSize, kvCacheScale
  • Tensor maps via grid_constant (not passed as regular parameters)
  • specDecParams (if SPEC_DEC)
  • semaphores, scratch

However, at line 2707-2722, when SPEC_DEC is defined, the call passes qSeqLen, nbKHeads, headGrpSize, qCuSeqLens as four separate parameters, but the kernel expects only nbKHeads at that position. Additionally, the call passes mask (line 2722) but the kernel has no mask parameter—it expects specDecParams instead.

The review comment requires verification of how SpecDecParams and BeamSearchParams should be constructed and passed, since the current call site appears to pass individual fields separately rather than properly constructed structs.

flashinfer/jit/xqa.py (4)

18-24: LGTM: Imports updated appropriately.

The added imports for SM-specific NVCC flags enable proper multi-architecture support.


26-35: LGTM: NVCC flags configured correctly.

The flags properly enable paged KV cache with layout 1, consistent with the conditional compilation paths in the C++ code.


47-55: LGTM: Flag generation logic is correct.

The conditional flag generation properly handles:

  • FP16 vs BF16 input (DTYPE and INPUT_FP16)
  • FP8 vs FP16/BF16 KV cache (CACHE_ELEM_ENUM)

38-46: All call sites are already updated with the new signature.

Verification confirms that:

  • The new fp16_input, fp8_kv_cache, and sm_version parameters are consistently used across the codebase
  • Both call sites (flashinfer/aot.py:404 and flashinfer/xqa.py:40) correctly pass the new parameters
  • Wrapper functions (get_xqa_module and xqa) use the updated signature
  • No use_fp16 parameter exists anywhere in the codebase

The API changes are complete and properly integrated.

csrc/xqa/xqa_wrapper.cu (2)

22-38: LGTM: Function signature updated appropriately.

The signature changes are well-designed:

  • run_fp8_mha parameter enables runtime selection between FP8 and standard MHA
  • Optional<TensorView> for attentionSinks is more idiomatic than raw pointers
  • Conditional KV cache parameters based on PAGED_KV_CACHE_LAYOUT properly support both layout modes

39-65: Function pointer approach is type-safe; signatures are compatible.

Verification confirms that both launchHopperF8MHAFlashInfer and launchMHAFlashInfer have identical signatures with matching conditional compilation blocks and parameter lists, making the function pointer assignment safe and correct.

flashinfer/aot.py (3)

358-372: LGTM: gen_xqa signature updated for multi-architecture support.

The function signature changes are consistent with the JIT module updates:

  • Parameter renaming improves clarity
  • SM gating ensures generation only when supported architectures are available
  • New fp8_kv_cache_ parameter enables FP8 KV cache configurations

373-412: Multi-SM architecture support implemented correctly.

The iteration logic properly:

  • Constructs sm_versions list based on available architectures
  • Iterates over SM versions along with other configuration parameters
  • Validates configurations before generating modules
  • Passes all parameters to gen_xqa_module consistently

527-546: LGTM: gen_all_modules updated consistently.

The changes to gen_all_modules properly wire through the new parameters and SM version support to the XQA generator.

csrc/xqa/tensorMap.cpp (2)

10-41: LGTM: Data type size lookup implemented correctly.

The getElemBytes function provides comprehensive coverage of CUDA tensor map data types with appropriate error handling.


75-117: Paged KV cache tensor map correctly supports two layout modes with consistent stride calculations.

The implementation correctly configures tensor map dimensions and strides for two distinct layouts:

  • VLLM Layout (PAGED_KV_CACHE_LAYOUT == 1): dimensions {headElems, nbKHeads, tokensPerPage, pages} with strides accounting for head-first ordering
  • XQA Layout (PAGED_KV_CACHE_LAYOUT == 0, default): dimensions {headElems, tokensPerPage, nbKHeads, pages} with strides accounting for token-first ordering

The dimension ordering aligns with memory access patterns throughout the codebase (verified in mha.cu, mhaUtils.cuh, and mha_sm90.cu). Both layouts apply the same swizzle modes and error handling. No issues identified.

csrc/flashinfer_xqa_binding.cu (2)

24-25: Good: Optional attention sinks.

Switching to tvm::ffi::Optional<TensorView> makes the API safer and clearer.


21-23: No changes needed; LOW_PREC_OUTPUT=0 is already set in compilation flags.

The codebase already includes "-DLOW_PREC_OUTPUT=0" in the xqa_nvcc_flags list within flashinfer/jit/xqa.py. This flag is passed to extra_cuda_cflags in the gen_jit_spec() call, ensuring the rcpOutScale parameter is not included in the C++ function signature. There is no ABI drift risk because the conditional parameter is compiled out consistently.

flashinfer/xqa.py (1)

50-72: Signature wiring looks consistent with the binding.

Param order matches xqa_wrapper (including run_fp8_mha, optional attentionSinks, and separate K/V caches).

If LOW_PREC_OUTPUT is ever enabled, extend these call sites to pass rcpOutScale or force -DLOW_PREC_OUTPUT=0 in JIT.

Also applies to: 73-91

Comment on lines 180 to 183
compute_capability = get_compute_capability(torch.device(device="cuda"))
if compute_capability[0] != 9:
pytest.skip("XQA only supports on Hopper at this moment")
if compute_capability[0] != 9 and run_fp8_mha:
pytest.skip("XQA supports fp8 mha only on Hopper GPUs")
set_random_seed(42)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Compute sm_count inside the test.

Set SM count after capability checks to avoid premature CUDA access.

 def test_xqa(
@@
-    compute_capability = get_compute_capability(torch.device(device="cuda"))
+    compute_capability = get_compute_capability(torch.device(device="cuda"))
     if compute_capability[0] != 9 and run_fp8_mha:
         pytest.skip("XQA supports fp8 mha only on Hopper GPUs")
     set_random_seed(42)
+    props = torch.cuda.get_device_properties(torch.cuda.current_device())
+    sm_count = props.multi_processor_count

Also applies to: 329-330

🤖 Prompt for AI Agents
In tests/attention/test_xqa.py around lines 180-183, the test currently calls
into CUDA to compute sm_count before checking compute capability and may access
CUDA prematurely; move the sm_count computation so it runs after the
compute_capability check and any pytest.skip decision (i.e., compute sm_count
only after verifying compute_capability and run_fp8_mha), and apply the same
change to the other occurrence around lines 329-330; ensure you call the
sm_count helper (or get_sm_count) with the CUDA device only after the skip logic
and after set_random_seed(42) if that ordering is required.

Signed-off-by: Qidi Sang <200703406+qsang-nv@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants